Spotify Track Analysis Tutorial

By: Anuoluwapo Faboro, John Gansallo & Zikora Anyaoku

Table of contents:

  1. Introduction
    • About the data
    • Libraries Used
  2. Data Collection
    • Document Set Up
    • Accessing the Data
    • Tidying the Data
  3. Exploratory Data Analysis
  4. Hypothesis
  5. Conclusion

Introduction:

In this project, we introduce streaming through services such as Spotify and Apple Music. Music has never been more personalized and accessible. With the advent of streaming services, we want to see if we can identify what makes songs popular on Spotify, by looking at audio features that Spotify uses to identify tracks such as:

The dataset contains about 600,000 songs gathered from the Spotify Web API, with song, artist, and release date information as well as song qualities such as acousticness, danceability, volume, tempo, and so on. The time span is from 1922 through 2020. When grabbing each track from the dataset, we can obtain track information such as track name, album, release date, length, and popularity. More importantly, Spotify’s API allows us to extract a number of audio features such as danceability, energy, instrumentalness, liveness, loudness, speechiness, acousticness, and tempo.

We live in the Big Data era. We can collect a large amount of data, allowing us to derive useful conclusions and make well-informed strategic decisions. However, as the volume of data grows, analyzing and exploring it becomes more difficult. When utilized effectively and responsibly, visualizations can be powerful tools in exploratory data research.Visualizations can also be used to convey a message or inform our audience about our results. Because there is no one-size-fits-all approach of visualization, different tasks involve diverse types of visualizations. In this study, we'll look at the Spotify dataset, available on Kaggle.

About the Data:

This data contains 600,00+ tracks that were released between 1922 till present. It also includes the track title, track id, the release date, the artist and the features each song contains. (https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) is where the data is obtained from.

Libraries Used:

In order to perform data analysis we will need various Python libraries that will allow us to access and parse the dataset. Some of the libraries we'll be using in this tutorial are listed along with their purpose.


Data Collection

Accessing the Data and Set Up

Before beginning the data analysis, we need to access the dataset. We do this by provided opendatasets the link to the Kaggle dataset and providing a Kaggle API token. Once the dataset is downloaded, we use pandas.read_csv to access tracks.csv.

After loading the data, we discover that there 586,672 rows (in this case, there are 586,672 tracks) and 20 columns (those tracks' attributes) within the tracks dataset. Due to the large size of the dataset, we will not be using the entirety of the dataset and thus some of the data will be left out of our analysis.

Tidying the Data

Oftentimes datasets don't come in nice, prepared packages to start analysis. Before we starting examining the data, we need to clean the data, because there could be missing data, formatting issues, and other issues that could impact our analysis.

First, we want to identify if there is any missing data within our dataset.

Datasets can often having missing data which can impact our analysis, so before we begin analyzing the data, we want to check if the tracks dataset has any missing data and if there is missing data we would have to consider the circumstances that it is missing under, in other to decide the steps we taking moving forward.

From the output above we notice 72 names are missing, but the other column within the dataframe do not contain any other missing values. We can assume that the data is missing completely at random (MCAR) for several reasons which are provided below:

  1. Only the name column has missing data has missing data
  2. The id column is the special identifier for the track so even if the name is missing, one can still identify the name of the track by accessing Spotify's API and using the trackID
  3. There’s no relationship between whether a data point is missing and any values in the data set, missing or observed.

Next, we want to make sure that our data is properly formatted.

Based on initial dataset, there are some columns within the Tracks dataset that need to be cleaned, for example, the release data isn't consistent as some rows include the day and month of the track release and others just include the year of release. Thus to streamline the data, we will be only considering the year of release.

Because there is only one track that was released in 1900, we will drop that row because it's an outlier that can affect our analysis.

To make the duration of a track more readable, we will convert the column duration_ms from milliseconds to minutes.

To make it easier for us to parse through the artist data, we will convert the artists column to a list rather than a Pandas Series.


Exploratory Data Analysis

Examining the relationship between the popularity of songs vs the identified aspects. The Exploratory Data Analysis step is a chance to examine the dataset to identify patterns within the dataset. Occasionally, while examining the dataset you may find that different features may not be related at all.

Attribute Categories

We've looked at the relationship between Popularity and Danceability amongst tracks over the past 10 years from 2010 to 2019 however there is no relationship between Popularity vs. Danceability. From this set of graphs we can surmise that danceability has no affect on the popularity of a track on Spotify. Next we'll see if there's a relationship between popularity and the tempo of a track.

After plotting the tempo and popularity of tracks released between 2010 and 2019, there also appears to be no relationship thus tempo can't be a factor in what makes a song popular on Spotify. Next we'll see if duration of a track will have an impact on a song's popularity. However based on the trend we're seeing so far, there may not be a relationship between the duration of a song and the population of a song.


Hypothesis Testing

We now want to put our theory to the test. However, let's have a clear picture of what hypothesis testing entails. Hypothesis testing is a statistical approach for assessing if a model you've constructed is a good match or not. Hypothesis testing consists of two types of hypotheses: a null hypothesis and an alternative hypothesis. The objective is that you want to put up your hypothesis in such a way that the null hypothesis is rejected. So now, what is the interpretation of rejecting a hypothesis? This is when the amount of relevance comes into play. In addition to the hypotheses, you must provide a significance level while planning your experiment. You reject the null hypothesis if the p value of your random variable is less than your significance level, also known as the rejection level.

We want to collect data from every song Beyonce has ever released for our study. The API can be used to produce a dataset in a few different ways. We could get a list of the artist's albums and then play each album track on a loop. Alternatively, we could cycle through a playlist we find on Spotify that has every track Beyonce has to offer, which would possibly more efficient.

Based on Beyoncé's discography from 2002 to present, our null hypothesis is that there's no relationship between her songs' popularity on Spotify and its audio features. Our alternative hypothesis says that there is a relationship.

We are going to perform the Least Squares linear regression on our dataset in order to determine our null hypothesis.

Based on the p-values within the OLS summary, which are all well above the 5% percent rejection, thus there's not evidence to suggest that there is a relationship between the popularity and the song's audio features.

Conclusion

We were trying to look for a relationship between Spotify features and if it played a role in determining a songs popularity. We have specifically looked at beyonces features to see if we can identify a pattern but unfortunate we were unable to identify it. It doesn’t mean that there aren’t other additional patterns but it just means that this particular pattern doesn’t exist. The steps that can be taken in the future is incorporating machine learning to see how the system performs or acts when it comes to identifying the patterns and making decisions with little human interaction. All of the code on this webpage can be obtained from the GitHub.

Resources

Spotify: https://developer.spotify.com/documentation/web-api/reference/#object-audiofeaturesobject

Visualization: https://www.fusioncharts.com/blog/best-python-data-visualization-libraries/ https://towardsdatascience.com/visualizing-spotify-songs-with-python-an-exploratory-data-analysis-fc3fae3c2c09

Machine Learning to pursue in future: https://medium.com/analytics-vidhya/predicting-song-popularity-71bc3b067237 http://www.math.utah.edu/~gustafso/s2018/2270/projects-2018/submittedprojects/sorenNelson/Spotify's%20Collaborative%20Filtering.pdf